{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-colab"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/pathwaycom/pathway/blob/main/examples/notebooks/tutorials/pagerank.ipynb\" target=\"_parent\"><img src=\"https://pathway.com/assets/colab-badge.svg\" alt=\"Run In Colab\" class=\"inline\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "notebook-instructions",
      "source": [
        "# Installing Pathway with Python 3.10+\n",
        "\n",
        "In the cell below, we install Pathway into a Python 3.10+ Linux runtime.\n",
        "\n",
        "> **If you are running in Google Colab, please run the colab notebook (Ctrl+F9)**, disregarding the 'not authored by Google' warning.\n",
        "> \n",
        "> **The installation and loading time is less than 1 minute**.\n"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "id": "pip-installation-pathway",
      "source": [
        "%%capture --no-display\n",
        "!pip install --prefer-binary pathway"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "id": "logging",
      "source": [
        "import logging\n",
        "\n",
        "logging.basicConfig(level=logging.CRITICAL)"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "id": "2",
      "metadata": {},
      "source": [
        "# Real-Time PageRank on Dynamic Graphs with Pathway\n",
        "\n",
        "## Introduction\n",
        "PageRank is best known for its success in ranking web pages in Google Search engine.\n",
        "Here is a [quick description](https://en.wikipedia.org/w/index.php?title=PageRank&oldid=1111494883):\n",
        "> PageRank works by counting the number and quality of links to a page to determine a\n",
        "> rough estimate of how important the website is. The underlying assumption is that\n",
        "> more important websites are likely to receive more links from other websites.\n",
        "\n",
        "In fact, the algorithm outputs a probability distribution that represents the\n",
        "likelihood of arriving at any particular page after randomly clicking on links for a while.\n",
        "We will simulate this behavior by the following 'surfing the Internet' procedure:\n",
        "- Initially, at each page, some amount of people start surfing the internet from that page.\n",
        "- In each turn, some users decide to click on a random link and visit a new page.\n",
        "- We iterate for a fixed number of rounds.\n",
        "\n",
        "This article assumes that you are already familiar with some basics of [Pathway transformations](/developers/user-guide/introduction/concepts#processing-the-data-with-transformations).\n",
        "\n",
        "## Code\n",
        "First things first - imports and constants."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3",
      "metadata": {},
      "outputs": [],
      "source": [
        "from typing import Any\n",
        "\n",
        "import pathway as pw\n",
        "\n",
        "# To use advanced features with Pathway Scale, get your free license key from\n",
        "# https://pathway.com/features and paste it below.\n",
        "# To use Pathway Community, comment out the line below.\n",
        "pw.set_license_key(\"demo-license-key-with-telemetry\")\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "4",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "source": [
        "### I/O Data\n",
        "We use an `Edge` schema to represent the graph and `Result` schema to represent the final ranks."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5",
      "metadata": {},
      "outputs": [],
      "source": [
        "class Edge(pw.Schema):\n",
        "    u: Any\n",
        "    v: Any\n",
        "\n",
        "\n",
        "class Result(pw.Schema):\n",
        "    rank: float"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "source": [
        "`pagerank` performs one turn of 'surfing the Internet' procedure by uniformly\n",
        "distributing rank from each node to all its adjacent nodes, for a fixed number of rounds.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "7",
      "metadata": {},
      "outputs": [],
      "source": [
        "def pagerank(edges: pw.Table[Edge], steps: int = 5) -> pw.Table[Result]:\n",
        "    in_vertices = edges.groupby(id=edges.v).reduce(degree=0)\n",
        "    out_vertices = edges.groupby(id=edges.u).reduce(degree=pw.reducers.count())\n",
        "    degrees = pw.Table.update_rows(in_vertices, out_vertices)\n",
        "    base = out_vertices.difference(in_vertices).select(flow=0)\n",
        "\n",
        "    ranks = degrees.select(rank=6_000)\n",
        "\n",
        "    grouper = edges.groupby(id=edges.v)\n",
        "\n",
        "    for step in range(steps):\n",
        "        outflow = degrees.select(\n",
        "            flow=pw.if_else(\n",
        "                degrees.degree == 0, 0, (ranks.rank * 5) // (degrees.degree * 6)\n",
        "            )\n",
        "        )\n",
        "\n",
        "        inflows = edges.groupby(id=edges.v).reduce(\n",
        "            flow=pw.reducers.sum(outflow.ix(edges.u).flow)\n",
        "        )\n",
        "\n",
        "        inflows = pw.Table.concat(base, inflows)\n",
        "\n",
        "        ranks = inflows.select(rank=inflows.flow + 1_000).with_universe_of(degrees)\n",
        "    return ranks"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8",
      "metadata": {
        "lines_to_next_cell": 0
      },
      "source": [
        "### Tests\n",
        "We present two easy test cases here.\n",
        "A test case with a single 3-vertices loop with one backward edge."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "9",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "            | rank\n",
            "^HJRW9JY... | 3945\n",
            "^4DM90AW... | 6981\n",
            "^G8TD95H... | 7069\n"
          ]
        }
      ],
      "source": [
        "# directed graph\n",
        "vertices = pw.debug.table_from_markdown(\n",
        "    \"\"\"\n",
        "      |\n",
        "    a |\n",
        "    b |\n",
        "    c |\n",
        "    \"\"\"\n",
        ").select()\n",
        "edges = pw.debug.table_from_markdown(\n",
        "    \"\"\"\n",
        "    u | v\n",
        "    a | b\n",
        "    b | c\n",
        "    c | a\n",
        "    c | b\n",
        "    \"\"\",\n",
        ").select(u=vertices.pointer_from(pw.this.u), v=vertices.pointer_from(pw.this.v))\n",
        "\n",
        "pw.debug.compute_and_print(pagerank(edges))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "10",
      "metadata": {},
      "source": [
        "Why these numbers? 3945, 6981, 7069? Feel free to skip the quite mathy explanation below.\n",
        "\n",
        "Let us calculate what the correct answer should be.\n",
        "PageRank actually finds a [stationary distribution](https://en.wikipedia.org/wiki/Markov_chain#Stationary_distribution_relation_to_eigenvectors_and_simplices)\n",
        "of a random walk on a graph in which the probability of each move depends only on the\n",
        "currently visited state, i.e. it is a Markov Chain.\n",
        "\n",
        "One may think that the transition matrix of the Markov chain in our example is\n",
        "$$\n",
        "P=\\left(\\begin{array}{cc}\n",
        "0.05 & 0.9 & 0.05\\\\\n",
        "0.05 & 0.05 & 0.9\\\\\n",
        "0.475 & 0.475 & 0.05\n",
        "\\end{array}\\right)\n",
        "$$\n",
        "We move to a new page with probability 5/6 uniformly distributed among all the linked (adjacent) pages,\n",
        "and with probability 1/6 we mix uniformly at random.\n",
        "The result is a stationary distribution roughly of $(x = ( 0.215 \\quad  0.397 \\quad 0.388) )$ which is proportional to the rank returned.\n",
        "However, we output only the approximation of this result, and our output is not normalized.\n",
        "\n",
        "### Summary\n",
        "As always, feel free to play and experiment with this code! In case you are looking for cool real-world\n",
        "graphs to experiment with, the [Stanford Network Analysis Project](https://snap.stanford.edu/) is an excellent source\n",
        "of reference instances, big and small."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}